Goto

Collaborating Authors

 recency bias



Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

Piskorz, Julianna, Pinneri, Cristina, Correia, Alvaro, Alfarra, Motasem, Garrepalli, Risheek, Louizos, Christos

arXiv.org Artificial Intelligence

Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.


Overcoming Recency Bias of Normalization Statistics in Continual Learning: Balance and Adaptation

Neural Information Processing Systems

Continual learning entails learning a sequence of tasks and balancing their knowledge appropriately. With limited access to old training samples, much of the current work in deep neural networks has focused on overcoming catastrophic forgetting of old tasks in gradient-based optimization.



Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

Thamma, Abishek, Heilbron, Micha

arXiv.org Artificial Intelligence

As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.


LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns

Horowitz, Idan, Plonsky, Ori

arXiv.org Artificial Intelligence

We investigate the choice patterns of Large Language Models (LLMs) in the context of Decisions from Experience tasks that involve repeated choice and learning from feedback, and compare their behavior to human participants. We find that on the aggregate, LLMs appear to display behavioral biases similar to humans: both exhibit underweighting rare events and correlation effects. However, more nuanced analyses of the choice patterns reveal that this happens for very different reasons. LLMs exhibit strong recency biases, unlike humans, who appear to respond in more sophisticated ways. While these different processes may lead to similar behavior on average, choice patterns contingent on recent events differ vastly between the two groups. Specifically, phenomena such as ``surprise triggers change" and the ``wavy recency effect of rare events" are robustly observed in humans, but entirely absent in LLMs. Our findings provide insights into the limitations of using LLMs to simulate and predict humans in learning environments and highlight the need for refined analyses of their behavior when investigating whether they replicate human decision making tendencies.


ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Xiong, Jing, Shen, Jianghan, Zheng, Chuanyang, Wan, Zhongwei, Zhao, Chenyang, Yang, Chiwun, Ye, Fanghua, Yang, Hongxia, Kong, Lingpeng, Wong, Ngai

arXiv.org Artificial Intelligence

Efficiently handling long contexts is crucial for large language models (LLMs). While rotary position embeddings (RoPEs) enhance length generalization, effective length extrapolation remains challenging and often requires costly fine-tuning. In contrast, recent training-free approaches suffer from the attention sink phenomenon, leading to severe performance degradation. In this paper, we introduce ParallelComp, a novel training-free method for long-context extrapolation that extends LLMs' context length from 4K to 128K while maintaining high throughput and preserving perplexity, and integrates seamlessly with Flash Attention. Our analysis offers new insights into attention biases in parallel attention mechanisms and provides practical solutions to tackle these challenges. To mitigate the attention sink issue, we propose an attention calibration strategy that reduces biases, ensuring more stable long-range attention. Additionally, we introduce a chunk eviction strategy to efficiently manage ultra-long contexts on a single A100 80GB GPU. To further enhance efficiency, we propose a parallel KV cache eviction technique, which improves chunk throughput by 1.76x, thereby achieving a 23.50x acceleration in the prefilling stage with negligible performance loss due to attention calibration. Furthermore, ParallelComp achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model trained on 8K-length context, outperforming powerful closed-source models such as Claude-2 and Kimi-Chat.


Analysis of LLM as a grammatical feature tagger for African American English

Porwal, Rahul, Rozet, Alice, Houck, Pryce, Gowda, Jotsna, Moeller, Sarah, Tang, Kevin

arXiv.org Artificial Intelligence

African American English (AAE) presents unique challenges in natural language processing (NLP). This research systematically compares the performance of available NLP models--rule-based, transformer-based, and large language models (LLMs)--capable of identifying key grammatical features of AAE, namely Habitual Be and Multiple Negation. These features were selected for their distinct grammatical complexity and frequency of occurrence. The evaluation involved sentence-level binary classification tasks, using both zero-shot and few-shot strategies. The analysis reveals that while LLMs show promise compared to the baseline, they are influenced by biases such as recency and unrelated features in the text such as formality. This study highlights the necessity for improved model training and architectural adjustments to better accommodate AAE's unique linguistic characteristics. Data and code are available.


Rope to Nope and Back Again: A New Hybrid Attention Strategy

Yang, Bowen, Venkitesh, Bharat, Talupuru, Dwarak, Lin, Hangyu, Cairuz, David, Blunsom, Phil, Locatelli, Acyr

arXiv.org Artificial Intelligence

Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architectural based on a hybrid attention mechanism that not only surpasses conventional RoPE-based transformer models in long context tasks but also achieves competitive performance on benchmarks requiring shorter context lengths.


The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models

Yoshida, Lui

arXiv.org Artificial Intelligence

This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119 prompts with different examples, and we calculate the quadratic weighted kappa (QWK) to measure the agreement between GPT and human rater scores. Regres-sion analysis is used to quantitatively assess biases introduced by example selec-tion. The results show that the impact of example selection on QWK varies across models, with GPT-3.5 being more influenced by examples than GPT-4. We also find evidence of majority label bias, which is a tendency to favor the majority la-bel among the examples, and recency bias, which is a tendency to favor the label of the most recent example, in GPT-generated essay scores and QWK, with these biases being more pronounced in GPT-3.5. Notably, careful example selection enables GPT-3.5 models to outperform some GPT-4 models. However, among the GPT models, the June 2023 version of GPT-4, which is not the latest model, exhibits the highest stability and performance. Our findings provide insights into the importance of example selection in few-shot prompting for AES, especially in GPT-3.5 models, and highlight the need for individual performance evaluations of each model, even for minor versions.